Week 8: Machine Learning 2

SECU0057

Bennett Kleinberg

5 Mar 2020


Applied Data Science

Week 8: Machine Learning 2

Today

  • unsupervised learning
  • core algorithm in detail
  • problems of unsupervised learning

Unsupervised ML

Problem for supervised approaches

  • most of the time we don’t have labelled data
  • sometimes there are no labels at all
  • core idea: finding clusters in the data

Examples

  • grouping of online ads
  • clusters in crime descriptions
  • collections of texts without authors

Practically all interesting problems are unlabelled data problems.

The unsupervised case

Aim

  • examining whether there are patterns (e.g. groups in the data)
  • possibly: a ‘grouped’ underlying data generation process
  • helpful because: reduces dimensions of the data

How to test whether there are patterns?

  1. separate data into a set number of clusters
  2. find the best cluster assignment of observations

Common method: k-means algorithm

1. Setting k

Let’s take \(k=4\).

unsup_model_1 = kmeans(data4
                       , centers = 4
                       , nstart = 10
                       , iter.max = 10)

Assigning cluster membership

Iterative algorithm

What happened in the iterations?

The k-means algorithm in detail

  • set random centroids in n-dimensional space
  • assign each observation to its closest centroid
  • find new centroids
  • re-assign the observations
  • (iterative approach)

Assigning cluster membership

Obtaining distances (errors)

Distance metric

  • typically: Euclidean distance
  • \(dist(p, c) = \sqrt{(p_1 - c_1)^2 + (p_2 - c_2)^2}\)

\(dist(p[1,1], c[2,3]) = \sqrt{(1 - 2)^2 + (1 - 3)^2} = \sqrt{5} = 2.24\)

Objective: \(\arg \min D(p_i, c_j)\)

After distance-based assignment

New centroids: k-MEANS

X Y Cluster
1 1 red
1 3 red
4 1 blue
5 5 blue

\(Mx_{red} = \frac {1+1}{2} = 1\)

\(My_{red} = \frac {1+3}{2} = 2\)

\(M_{red} = [1, 2]\)

New centroids

Iteration after iteration

Cluster membership after iteration 2

Stopping rule

If any of these apply:

  • convergence (i.e. no points change cluster membership)
  • max. number of iterations (iter.max = ...)
  • distance threshold reached

What’s strange about our approach?

How do we know k?

Possible approach:

  • run it for \(n\) combinations: \(k=1, k=2, ... k=n\)
  • assess how good k is

What does “good” mean?

Determining k

WSS = within (cluster) sum of squares

  • take difference between each point \(x_i\) in cluster \(c_j\)
  • remember: \(c_j\) is now the mean of all points \(x_{i,j}\)
  • so: we square the difference

\(\arg \min \sum\limits_{x_{i,j}, c_j}(x_{i, j} - c_j)^2\)

Cluster determination

wss = numeric()
for(i in 1:20){
  kmeans_model = kmeans(data4, centers = i, iter.max = 20, nstart = 10)
  wss[i] = kmeans_model$tot.withinss
}

For \(k=1 .. k=20\)

wss
##  [1] 1998.00000  799.23145  529.42464  405.14898  341.16308  293.44305
##  [7]  256.25549  226.13568  201.62530  181.03906  163.43303  152.20691
## [13]  143.17168  133.78717  124.50437  117.49929  111.04724  102.77820
## [19]   97.30524   91.73814

Scree plot (= the elbow method)

Other methods to establish k

  • Silhoutte method (cluster fit)
  • Gap statistic

See also this tutorial.

Silhouette method

Gap statistic

Applying k-means clustering

We settle for \(k = 2\)

unsup_model_final = kmeans(data4
                       , centers = 2
                       , nstart = 10
                       , iter.max = 10)

Plot the cluster assignment

Other unsupervised methods

  • k-means (today)
  • hierarchical clustering
  • density clustering

Issues with unsupervised learning

What’s lacking?

What can you (not) say?

Caveats of unsup. ML

  • there is no “ground truth”
  • interpretation/subjectivity
  • cluster choice

Interpretation of findings

Interpretation of findings

unsup_model_final$centers
##       salary     height
## 1 -0.7474895 -0.7551138
## 2  0.7937260  0.8018218
  • Cluster 1: lower salary, shorter height
  • Cluster 2: higher salary, larger height
  • People in cluster 1 earn less and are shorter than those in cluster 2

We cannot say more than that!

Interpretation of findings

Interpretation of findings

  • subjective
  • labelling tricky
  • researcher’s choice!
  • be open about this

Cluster choice

What if we chose \(k=3\)?

When k changes, the interpretation changes

km_3$centers
##       salary     height
## 1 -1.1253285 -0.7403048
## 2  0.7959880  1.1611042
## 3  0.4627853 -0.4561074

Interpretation for k=3

  • Cluster 1: avg-to-high salary, small
  • Cluster 2: very low salary, small
  • Cluster 3: high salary, very tall

Cluster choice

  • be open about it
  • make all choices transparent
  • always share code and data (“least vulnerable”" principle)

Important

Note: we cannot say anything about accuracy.

See the k-NN model.

Bigger picture of machine learning

  • covered so far: supervised + unsupervised learning
  • next week: neural networks

How do supervised and unsupervised learning relate to each other?

Case example

  • suppose you want to measure hate speech in the UK
  • on Twitter
  • and you have 10m Tweets of interest

Possible approach

  • you craft rules to determine hate speech vs non-hate speech
  • problematic: might not capture all dynamics + costly

Better: supervised machine learning (text classification)

Text classification approach

  • you annotate some data (typically crowdsourced)
  • you build a supervised learning model
  • with proper train-test splitting
  • and assess the model with \(Pr_{hatespeech}\)

Suppose you have a good enough model.

Remember

  • the aim was to measure hate speech in the UK
  • your model should now be good to annotate unlabelled data
  • i.e. you can use the model on all Tweets
  • and then answer the RQ

What’s next?

  • Today’s tutorial + homework: unsupervised learning in R

Next week: Machine Learning 3